Introduction to R

Module 1

Author
Affiliation

Matthew G. Son

University of South Florida

The R Language

Ecosystem

  • Easy and intuitive to learn
  • Robust quantitative finance packages
  • Robust statistical & econometric packages
  • State-of-art visualizations
  • Fast and efficient modern packages
  • Versatile
    • Python (reticulate), C++ (rcpp)
    • ML (keras, h2o), Apache Spark (sparklyr), Arrow (arrow)

Great visualizations

Fast statistical packages

Excellent syntax integration

Write once, run everywhere: dplyr syntax that can be used in:

  • arrow for fast out-of-core data

  • dbplyr for databases (duckdb, SQLite, etc.)

  • sparklyr for spark integration

  • duckplyr for duckdb backend

  • and others (dtplyr, tidypolars)

Introduction to File Systems

What is a Path?

  • A path is an address of file or folder

  • Types of paths:

    • Absolute path: exact location from the root
    • Relative path: relative location from current working directory

Absolute Paths

  • Begins from the root directory (/ in Mac/Linux, C:\ in Windows)

    • Mac/Linux: /Users/username/Documents/project/data.csv
    • Windows: C:\Users\username\Documents\project\data.csv
  • Unambiguous

  • Not dependent on the current working directory

  • OS dependent (Windows vs macOS & Linux)

Relative Paths

  • The location relative to the current working directory.

    • Example: data/project/data.csv
  • Portable across systems

  • Easier to manage in projects

Working Directory

The working directory is the folder where R is located

  • getwd() shows the current working directory
  • setwd("/path/to/directory") to change

Home Directory

  • The “base” directory for user in operating system
  • Used to store personal files

Windows: username/Documents folder

  • The default home directory for R is specified at C:\Users\<username>\Documents.
  • john home directory would be C:\Users\john\Documents.

Mac / Linux : username folder

  • /Users/<username>
  • e.g. /Users/john

Tilde ~

Represents the user’s home directory.

In R, ~ refers to

  • C:\Users\<username>\Documents for windows
  • /Users/<username> for Mac/Linux

Example: ~/cases refers to :

  • C:\Users\<username>\Documents\cases for windows
  • /Users/john/cases for Mac/Linux

R Basics

Basic Calculations

You can use R to do basic math calculations

1 / 200 * 30
[1] 0.15
(59 + 73 + 2) / 3
[1] 44.66667

Concatenate multiple elements

To concatenate multiple elements to a vector:

use c() function:

c(1, 2) # a vector (2 elements)
[1] 1 2
c(2, 3, 4, 6, 7) # a vector with 5 elements
[1] 2 3 4 6 7
Tip

An (atomic) vector in R is an ordered pair of multiple elements.

Vectorized calculations

You can simply calculate vectorized calculations with R:

c(1, 3, 5) + c(2, 3, 4) # plus on each element of vector
[1] 3 6 9
c(5, 6, 7, 8) * 3 # multiply each (broadcast x3 to all elements)
[1] 15 18 21 24
Tip

Vectorized operations are much faster than looping over each elements.

In R, most operations are automatically vectorized.

Assigning values

Use <- to assign a value to a symbol (name).

my_number <- 3 # object of value 3 is created, with its symbol my_number
print(my_number * 3)
[1] 9

Here, my_number is called symbol, or name of an object.

Tip

Style guide: though you still can use =, use <- for assignment.

Use = for specifying function arguments instead.

Some IDEs (i.e. RStudio / Positron / VScode) have Alt (Option) + - as a shortcut.


R has strict rules about a syntactic name (symbol).

  • It is case sensitive
a <- 3
print(A)
Error: object 'A' not found
  • It cannot contain whitespace
my number <- 1 # error
Error in parse(text = input): <text>:1:4: unexpected symbol
1: my number
       ^
  • It cannot start with numbers
my_number_1 <- 15
1_my_number <- 10 # error
Error in parse(text = input): <text>:2:2: unexpected input
1: my_number_1 <- 15
2: 1_
    ^

  • You can’t use reserved words like TRUE, NULL, if, etc.

  • If you’d deliberately use non-syntactic names, use backtick `

# Use backtick escaping only when you have to
`my number` <- 1
print(`my number`)
[1] 1

Object naming conventions

Since objects cannot contain whitespace as symbol, there are two popular naming conventions.

  • snake_case

  • camelCase

when_you_have_very_long_name_object <- 12
YouCanNameItLikeThis <- 1
Tip

It is better to make a short, self-explanatory name.

e.g. weight <- 15 is easier to understand than my_variable_quantity <- 15

R Prompt

  • On console: > means: “Waiting your command”

  • + means: “Continue command”

  • Hit CTRL + C to abort in console


> cat("Here's command I'm typing
+ but it was unfinished
+ so R waits until I finish expression...
+ 
+ hitting Enter doesn't abort at all..
+
+ press ESC or CTRL+C to abort 
")

Object types in R

  • Vector type: common data type

  • Special type (non-vector): non-vectors

    • functions, environments, etc.

Vectors are the most important family of data types in R.

Vector type

Vector is a data structure that stores multiple elements. It comes in two flavors:

  • Atomic vector: all elements same type
  • Generic vector: known as list, can have different types of elements

NULL is not a vector, but often serves as zero length vector.

Atomic vectors

There are four primary types of atomic vector in R, and two others.

Type of Atomic vectors

  1. Logical (or Boolean): TRUE, FALSE, NA
logical_test <- 3 > 5
print(logical_test)
[1] FALSE
typeof(logical_test)
[1] "logical"

  1. Integer: integer numbers

Attach L to treat the number as strict integer number.

my_integer <- 3L # L specifies the number is integer
typeof(my_integer)
[1] "integer"

  1. Double: real numbers
my_number <- 3.125 # A length of 1 vector
typeof(my_number)
[1] "double"
Caution

numeric” is a collective term for both double and integers but often used as if it were a synonym for “double” or “real number” in practice.


  1. Character (or string): words, wrapped by " or '
my_name <- "Matthew Son" # wrap double quotes around
korean_name <- 'Gunsu Son' # or single quotes
typeof(my_name)
[1] "character"
typeof(korean_name)
[1] "character"
Tip

Style guide: Use double quote " for character instead of ' if possible.


  1. Two other types:
  • raw type: binary data type

  • complex type: complex numbers (e.g. 3 + 4i)

  • rarely needed in Finance

Missing values: NA

  • Missing values are denoted by NA
    • Not Applicable: similar to “undefined” above
  • They are not identical to zero or NULL
    • NULL is intentional empty “placeholder” in R

Exercise

  1. What are four primary types of atomic vector?

  2. What are the types of a,b,c,d below?

a <- TRUE
b <- 3.56
c <- "Logical"
d <- 6

Confirm your answer with typeof().

List

List is a generic vector that is not atomic.

  • Atomic vector can have only one type for its elements (Double, Integer, Logical, …)

  • List can hold multiple data types for its member (even list itself)

example_list <- list(1L, 3.5, 'Hi', TRUE)
print(example_list)
[[1]]
[1] 1

[[2]]
[1] 3.5

[[3]]
[1] "Hi"

[[4]]
[1] TRUE
typeof(example_list)
[1] "list"

Class attribute

The class attributes in R is used to define the behavior of objects with functions.

Especially important classes in Finance are:

  • Date, Time
  • Factors
  • Dataframe
  • or your own custom-built class

Example: Date/Time

Very important class in Finance.

They are built from double type atomic vector (type), but has own specific rule for uses (class).

today <- Sys.Date() # returns today's date
now <- Sys.time() # returns time now
print(today)
[1] "2026-02-09"
print(now)
[1] "2026-02-09 17:56:06 EST"

Check their data type:

typeof(today)
[1] "double"
typeof(now)
[1] "double"

Check their attributes: they have class attributes.

attributes(today)
$class
[1] "Date"
attributes(now)
$class
[1] "POSIXct" "POSIXt" 

To directly access the class attribute:

class(today)
[1] "Date"
class(now)
[1] "POSIXct" "POSIXt" 

Class attribute and change of behavior

For an example, see how it works with + function.

print(today + 1) # adds one day
[1] "2026-02-10"
print(now + 1) # adds one seconds
[1] "2026-02-09 17:56:07 EST"

Q: Why +1 yield different results?

A: Because they are in different classes. + 1 is inferred differently.

Class attribute gives context how it should behave with functions.


In the deep down, they are just numbers:

# strip off class attribute (only)
unclass(today)
[1] 20493
unclass(now)
[1] 1770677767
  • Date: The value of double represents the number of days since “1970-01-01” (Unix Epoch)
  • Time: the number of seconds since Unix Epoch

Exercise

Class exercise

  1. Execute typeof(c(1,2,3)) and typeof(c(1L, 2L, 3L)). What’s the difference?
  2. Assign a vector with three elements: 1,3,5, and name it as my_first_object
  3. Assign another object with one element: 5, name it as MySecondObject
  4. Multiply my_first_object with MySecondObject. What do you get?
  5. Assign a vector with your name: my_name
  6. What do you get when you execute my_name + 3? Why do you get this result?

Importance of class

Since class determines the behavior of the object, it is crucial to know your data class especially performing function calls.

Note

Calling a function means executing/applying a function.

As you cannot use add function on character and numeric.

# Another way to perform plus
`+`(3, 4) # good
[1] 7
`+`(3, "Hi") # error
Error in 3 + "Hi": non-numeric argument to binary operator

Functions

Defining a function

You can define custom function (User-defined function) in R with the following syntax:

my_first_function <- function(argument1, argument2) {# input arguments
  
  # write what the function would to with arguments

  return() # output
}

The function can be called in prefix form:

my_first_function(argument1 = 1, argument2 = "hi")

Functions example 1

Our first function, c() concatenates all the values and generate a single object!

c(1, 3, 5) # c() function takes arbitrary number of arguments!! (... part)
[1] 1 3 5
c(1, 3, 5, 7, 9)
[1] 1 3 5 7 9

Functions example 2

seq function: generates a sequence of numbers.

  • It has three arguments: from and to and by
# generate a vector of sequence from 1 to 10 by 2
seq(from = 1, to = 10, by = 2)
[1] 1 3 5 7 9
  • If users don’t specify the argument name, it reads input in order
seq(1, 10, 2) # reads first three
[1] 1 3 5 7 9
  • If user enters more args than the function space, it raises error.
seq(1, 10, 2, 5)
Error in seq.default(1, 10, 2, 5): too many arguments

Getting Help on Functions

All R functions are built by someone, and documentation is typically provided.

For detailed description of any function, use ? followed by the function’s name.

For example, try below code in your console:

?seq

Or, use help()

help(seq)

Build a Perpetuity Calculator

The present value of a perpetuity, where the cash flow grows at a constant rate g, is given by:

\[ PV_{PER} = \frac{PMT}{r - g} \]

where

  • PMT is the payment or cash flow.
  • r is the discount rate.
  • g is the growth rate of the cash flow.

This formula applies when r > g.

Defining a function

You can design your perpetuity function in R with following syntax:

pv_per <- function(pmt, r, g) {
  # write what the function would to with arguments
  n <- pmt
  d <- r - g

  # then return the result
  return(n / d)
}

Calling a function

Let’s call the function above:

  • What is the PV of perpetuity, when PMT = $10,000, r = 7% and g = 3%?
pv_per(pmt = 10000, r = 0.07, g = 0.03)
[1] 250000
  • Assign the result value of the function
answer <- pv_per(10000, 0.07, 0.03)
print(answer)
[1] 250000
  • Vector can be the input (vectorized)
pv_per(c(100, 1000, 10000), 0.07, 0.03) # r, g broadcasted
[1]   2500  25000 250000
pv_per(c(100, 1000, 10000), c(0.07, 0.05, 0.04), 0.03) # g broadcasted
[1]    2500   50000 1000000

Exercise

  1. Define a perpetuity calculator function, pv_per(). What is the pv when PMT = $50,000, r = 4%, g = 0%?

  2. What is the pv when PMT = $50,000, r = 4%, but g are 1%, 2%, 3%?

Default Arguments

What happens if user doesn’t specify one argument?

pv_per(10000, 0.07) # g is missing!
Error in pv_per(10000, 0.07): argument "g" is missing, with no default

You can set default values for arguments, allowing them to be omitted when calling the function.

# set default g to be zero
pv_per2 <- function(pmt, r, g = 0) {
  n <- pmt
  d <- r - g
  # then return the result
  return(n / d)
}
# prefix call without g specified
pv_per2(10000, 0.07)
[1] 142857.1

Anonymous function

# Anonymous function example
function(x, y) {
  return(x + y + 1)
} # function with no name

my_func <- function(x, y){
  return(x + y + 1)
} # function with name my_func

Functions are typically named so they can be reused multiple times.

However, you can skip naming a custom function, and they are called anonymous function.

  • Useful when the function is simple and called only one time.

They are not stored as objects since they do not have assigned symbols (names).

Syntactic sugar: Anonymous Function

Note

Syntactic sugar refers to a feature in programming that makes the code simple to read or write, without adding functionality.

(Anonymous) functions can be defined with syntactic sugar:

\(x, y) x + y + 1

\(x, y) {
  x + y + 1
} # or with curly braces

Exercise

Convert below perpetuity function (pv_per) to anonymous function:

pv_per <- function(pmt, r, g) {
  return(pmt / (r - g))
}

Syntactic sugar: Pipe Operator

A Motivating example

Solve below math problem. Describe your steps. What was the first and the last step?

\[ \sqrt{(2+4)^2 - 3 * 4} = ? \]


\[ \sqrt{(2+4)^2 - 3 * 4} \]

  1. Do 2+4 and then square it, and save it in your memory
  2. Do 3*4 and then subtract it from previous, and update your memory
  3. and then square root the value
  • Similarly, codes can be written not in the order we calculate.

  • It is easier for us to read & write code in the order it is operated.


When we have composite function calls such as

f(g(h(k(x))))

The call sequence is x -> k() -> h() -> g() -> f().

It is rather easier to read, write and debug if we can write a code like:

# Note: this is not a real code!
x then
  call k() on above and then
  call h() on above output and then 
  call g() on above output and then
  call f() on above output

Pipe operator & Function Chain

This is where pipe operator |> becomes handy in R.

The pipe operator does “and then” job, and it can be written as:

x |>
  k() |>
  h() |>
  g() |>
  f() # Do not put pipe at the end!

Tip

Style guide: use |> instead of %>%. Use shortcut Cmd (Ctrl) + Shift + M.

Sometimes you’ll see %>% operator instead, which comes from external library in R (magrittr), meanwhile |> is R native. In order to use %>%, external package library(magrittr) should be imported.

Exercise (challenge!)

Solve \(\sqrt{2^3}\) using pipe operator.

  1. First, solve above procedual way
  • For square root, use sqrt() function
  1. Next, solve using the pipe operator.
  • Define function named cube that does x^3
  • Code should start with 2.

External packages

Packages are add-on libraries that extend the functionality of R.

  • They provide additional functions, datasets, and tools for various tasks
  • Can be easily installed with install.packages()
  • And loaded in R session with library()

Installing packages:

# Installs tidyverse package if not installed
# Once installed, you don't need to install again
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
}

Load packages: you need to load packages to use its functionality.

  • Need to load only once per session
library(tidyverse)

Control Structure

Control Structure?

Control structure dictates which code gets executed and when.

  1. Conditional Statements:
    • if statements: Execute code if a condition is true.
    • else/else if statements: Execute alternative code if the condition is false.
  2. Loops:
    • for loops: Repeat code block a specified number of times.
    • while loops: Continue executing code as long as a condition is true.
  3. Map (apply):
    • Map a function to each element of a collection without explicitly writing loops.

If-else

The basic form of if and if-else statement in R:

# Full form

if (condition) {
  true_action
} else {
  false_action
}

Example 1: if and else executes code based on logical conditions.

stock_price <- 115
if (stock_price > 110) {
  print("The stock price has increased significantly!")
}
[1] "The stock price has increased significantly!"

Example 2: If condition is not met, then nothing happens (skipped).

stock_price <- 100
if (stock_price > 105) {
  print("The stock price has increased!")
}

Example 3: else if checks one more logic condition:

stock_price <- 115
if (stock_price > 120) {
  print("The stock price has surged!")
} else if (stock_price > 110) {
  print("The stock price has increased moderately.")
}
[1] "The stock price has increased moderately."

Example 4: There can be multiple else if

stock_price <- 108
if (stock_price > 120) {
  print("The stock price has surged!")
} else if (stock_price > 110) {
  print("The stock price has increased moderately.")
} else if (stock_price > 105) {
  print("The stock price has increased slightly.")
}
[1] "The stock price has increased slightly."

Example 5: else is executed when all of if conditions are not met.

stock_price <- 100
if (stock_price > 120) {
  print("The stock price has surged!")
} else if (stock_price > 110) {
  print("The stock price has increased moderately.")
} else if (stock_price > 100) {
  print("The stock price has increased slightly.")
} else {
  print("The stock price has decreased.")
}
[1] "The stock price has decreased."

Logical Test Operators

  • Less than (< or >)
  • Less than or equal (<= or >=)
  • Equality (==) and Inequality (!=)
  • Logical NOT (!)
  • AND (&), OR (|)

Exercise

Write an if-else statement:

  • If PMT > 1000, add 10000 to PMT (i.e., PMT <- PMT + 10000)
  • Else if PMT > 500, add 100 to PMT
  • Else, set PMT = 0

What is the outcome of above if-else, if initial PMT was 750?

For loops

For loops are used when code has to be iterated a specified number of times.

# Syntax
for (iterator in iterable) {
  body
}
for (i in 1:5) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

If for loop was explictly written:

# iterator i 
i <- 1
print(i)
[1] 1
i <- 2
print(i)
[1] 2
i <- 3
print(i)
[1] 3
i <- 4
print(i)
[1] 4
i <- 5
print(i)
[1] 5

To use number index for each element: use seq_along() in the loop.

x <- c('a', 'b', 'c', 'd')
seq_along(x) # generate integer squence along the vector x
[1] 1 2 3 4
for (a in seq_along(x)) {
  print(x[a])
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"

x itself can be a iterable:

for (el in x) {
  # direct elements
  print(el)
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"

next and break

Generally used with if-else condition tests inside loop.

next is used to skip an iteration of loop.

for (i in 1:5) {
  if (i == 3) {
    # skip if i == 3
    next
  }
  print(i)
}
[1] 1
[1] 2
[1] 4
[1] 5

break is used to exit loop immediately.

for (i in 1:5) {
  if (i == 4) {
    break
  }
  print(i)
}
[1] 1
[1] 2
[1] 3

For loop: Compound interest

How to calculate compound interest over multiple years using a for loop?

  • Principal: $10,000
  • Interest rate: 5%
  • Number of years: 10
principal <- 10000
rate <- 0.05
num_periods <- 10
# generate placeholder vector for the cashflow
cashflow <- vector("numeric", length = num_periods)

for (i in 1:num_periods) {
  value <- principal * (1 + rate)^i
  cashflow[[i]] <- value
}
print(cashflow)
 [1] 10500.00 11025.00 11576.25 12155.06 12762.82 13400.96 14071.00 14774.55
 [9] 15513.28 16288.95

Exercise

Based on the previous example, do the following:

  • Q1. Skip the first year using if and next

    • Cashflow should have zero printed on the first slot
  • Q2. Stop the calculation if value exceeds $14,000

    • Cashflow should have zero printed on slots that exceed value of $14,000

Exercise 2

Write a function that checks class of an input.

If the input is numeric, print “Numeric input!”, otherwise, print “Not numeric!”

  • use inherits(x, "numeric") for logical test.
# example outcome
my_function(c(1, 4, 5))
[1] "Numeric input!"
my_function(c('a', 'b', 'c'))
[1] "Not numeric!"

Function mapping

map or apply is an implicit function loop.

  • a function f is an input arg for map()
  • map() requires tidyverse or purrr package
  • Succinct and easy to read than for loops
  • Easy to use on data frames

Example: map()

  • Output is always list
plus_one <- function(x) {
  return(x + 1)
}
# need to import tidyverse to use map()
library(tidyverse)

map(1:3, plus_one) # 1 2 3 is input for plus_one()
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

map function 2

If the desired output is not list but atomic vector:

  • map_dbl() a numeric (double) vector
  • map_chr() a character vector
  • map_lgl() a logical vector
  • map_int() an integer vector
map(1:3, plus_one) |> class()
[1] "list"
map_dbl(1:3, plus_one) |> class()
[1] "numeric"

Exercise

  1. Generate times_two() function that multiplies input by 2.

  2. map times_two function over 1:10

  3. Achieve same result with anonymous function instead.

  4. Achieve same result with for loop.

Lab Exercise 1

Convert below for loop operation to map()

multiply_two = \(x) x * 2
N <- 10
temp_list = vector("list", N)
for (i in 1:N){
  temp_list[[i]] = multiply_two(i)
}

Financial Data Manipulation

data.frame class

  • One of the most important data class in R, built on top of list type

  • Stores data structure in 2D tabular form:

    • with rows (observations, or records)

    • and columns (variables)

  • columns can be different types!

# example: diamonds dataset from tidyr
library(tidyverse) 
data(diamonds) # load up data to current environments
diamonds |> head(3) # show first 3 rows / same: head(diamonds, 3)
# A tibble: 3 × 10
  carat cut     color clarity depth table price     x     y     z
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31

Creating a data.frame

  • Creating a data.frame is similar to list
# creating data.frame is similar to creating a list
my_dataframe <- data.frame(
  a = c(1,2,3),
  b = c('a','b','c'),
  c = c(TRUE, FALSE, FALSE)
)
print(my_dataframe)
  a b     c
1 1 a  TRUE
2 2 b FALSE
3 3 c FALSE

Exercise

Create a dataframe named as housing:

  • 6 columns: Name, Age, Sex, Income, Housing, Zipcode
    • Name: Amy, Bill, Charles, Donna, Eckert
    • Age: 21, 25, 30, 38, 49
    • Sex: Female, Male, Male, Female, Male
    • Income: 36000, 53000, 89000, 82000, 166000
    • Housing: “Rent”, “Rent”, “Own”, “Own”, “Rent”
    • Zipcode: 12333, 12543, 11255, 12333, 33533

What are the type (class) of each column automatically recognized by R?

  • Check with str(housing).

Q: What should be their type (class) in theory?

Modern R Syntax

dplyr / tidyverse

  • A modern, new approach that revolutionized R
  • Very fast, written in C++
  • Verbal and easy to read
  • Developed by RStudio team (Posit)

dplyr verbs

Key verbs

  • Verbs for the core functionality
  • Those verbs are used with pipe operator |> or |>
  1. select() : select subset of columns
    • rename() : rename columns
    • relocate() : change column positions
  2. filter() : select subset of rows with condition
  3. arrange() : reorder rows
  4. mutate() : add new columns (variables)
  5. summarize() : generate summary table based on group_by()

dplyr::select()

Example 1:

iris |> 
  select(Sepal.Length, Petal.Width, Species) |> 
  head(3)
  Sepal.Length Petal.Width Species
1          5.1         0.2  setosa
2          4.9         0.2  setosa
3          4.7         0.2  setosa

Example 2:

Select from “Sepal.Length” (1st) to “Petal.Length” (3rd) column

iris |> select(Sepal.Length:Petal.Length) |> head(3)
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
iris |> select(1:3) |> head(3) # the same
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3

Example 3:

Select columns except for specified columns

iris |> 
  select(!(Sepal.Length:Petal.Length)) |> 
  head(3)
  Petal.Width Species
1         0.2  setosa
2         0.2  setosa
3         0.2  setosa

Convenience functions

Convenience functions are allowed within select().

Examples:

  • starts_with(), ends_with(), contains()

  • …and lots more!

  • matches(), num_range(), all_of(), any_of(), everything(), last_col(), where()

  • Check tidyselect documentation

iris |> 
  select(starts_with("Sepal")) |>
  head(3)
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
iris |> 
  select(ends_with("Length")) |> 
  head(3)
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
iris |> 
  select(contains("al")) |> 
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2

dplyr::rename()

iris |> 
  rename(sepal_len = Sepal.Length) |>  # change variable name
  head(3)
  sepal_len Sepal.Width Petal.Length Petal.Width Species
1       5.1         3.5          1.4         0.2  setosa
2       4.9         3.0          1.4         0.2  setosa
3       4.7         3.2          1.3         0.2  setosa

dplyr::relocate()

Rearrange the column order.

Non-mentioned columns retain original ordering.

iris |> 
  relocate(Species, ends_with('Length')) |> 
  head(3)
  Species Sepal.Length Petal.Length Sepal.Width Petal.Width
1  setosa          5.1          1.4         3.5         0.2
2  setosa          4.9          1.4         3.0         0.2
3  setosa          4.7          1.3         3.2         0.2

dplyr::filter()

Example 1: Subset rows with conditions

diamonds |> 
  filter(carat > 0.23) |> 
  head(3)
# A tibble: 3 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
2  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
3  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Example 2: Multiple conditions

diamonds |> 
  filter(carat < 0.28 & carat > 0.23, # use &
         cut == 'Good') |> # or condition
  head(3)
# A tibble: 3 × 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.26 Good  D     VS2      65.2    56   403  3.99  4.02  2.61
2  0.26 Good  D     VS1      58.4    63   403  4.19  4.24  2.46
3  0.26 Good  E     VVS1     57.9    60   554  4.22  4.25  2.45
  • c.f.) If you want to filter rows based on row numbers, use slice()
diamonds |> 
  slice(1,3) # slice first and 3rd row
# A tibble: 2 × 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31

Class Exercise

From diamonds dataframe (load tidyverse for access),

  1. Subset the dataframe with below conditions:

    1. carat is equal to 0.26 and;

    2. clarity is “VS2” and;

    3. select columns “carat”, “cut”, “clarity”

    4. then store it as sub_diamonds

  2. What is the dimension of sub_diamonds? Check with dim(sub_diamonds)

dplyr::arrange()

Arrange, or sort the dataframe based on the specified column value

iris |> 
  arrange(Sepal.Length) |> # ascending by default
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.3         3.0          1.1         0.1  setosa
2          4.4         2.9          1.4         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
  • Can be arranged in descending order
iris |> 
  arrange(-Sepal.Length) |> # or, arrange(desc(Sepal.Length)) 
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.9         3.8          6.4         2.0 virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.7         2.6          6.9         2.3 virginica

Arranging with multiple variables: Hierachical ordering

iris |> 
  arrange(Sepal.Length, Sepal.Width) |> 
  head(5)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.3         3.0          1.1         0.1  setosa
2          4.4         2.9          1.4         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         3.2          1.3         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
Caution

It is incorrect to chain arrange function:

iris |> arrange(Sepal.Length) |> arrange (Sepal.Width)

because it resets ordering.

dplyr::mutate()

Compute transformation of variables and create new column.

iris |> 
  mutate(new_length = Sepal.Length + Petal.Length) |> 
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_length
1          5.1         3.5          1.4         0.2  setosa        6.5
2          4.9         3.0          1.4         0.2  setosa        6.3
3          4.7         3.2          1.3         0.2  setosa        6.0

Example: Z-score standardizing

\[ \frac{X - \bar{X}}{\sigma_X} = \frac{X - mean(X)}{sd(X)} \]

iris |> 
  mutate(
    std_sepal_length = 
      (Sepal.Length - mean(Sepal.Length))/ sd(Sepal.Length)) |> 
  select(Sepal.Length, std_sepal_length) |> 
  head(3)
  Sepal.Length std_sepal_length
1          5.1       -0.8976739
2          4.9       -1.1392005
3          4.7       -1.3807271

dplyr::summarize()

Many times we are interested in getting summary statistics for groups.

  • Summarizing is also called as data aggregation

  • Often used with group_by() , to generate summary


Example: What is the average of “Sepal.Length” by “Species” in iris dataset?

# Let's browse iris data
iris |> 
  group_by(Species) |> 
  slice(1:3) # show first 1~3 rows from each Species
# A tibble: 9 × 5
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5.1         3.5          1.4         0.2 setosa    
2          4.9         3            1.4         0.2 setosa    
3          4.7         3.2          1.3         0.2 setosa    
4          7           3.2          4.7         1.4 versicolor
5          6.4         3.2          4.5         1.5 versicolor
6          6.9         3.1          4.9         1.5 versicolor
7          6.3         3.3          6           2.5 virginica 
8          5.8         2.7          5.1         1.9 virginica 
9          7.1         3            5.9         2.1 virginica 

Above code shows the first 3 rows for each Species.


Example: What is the average of Sepal.Length by each group?

iris |> 
  group_by(Species) |> 
  summarize(avg_sepal_length = mean(Sepal.Length)) 
# A tibble: 3 × 2
  Species    avg_sepal_length
  <fct>                 <dbl>
1 setosa                 5.01
2 versicolor             5.94
3 virginica              6.59

What if we want to summarize across every column?

  • Example: Calculate average of across all columns by each group
iris |> 
  group_by(Species) |> 
  summarize(across(everything(), mean))
# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 
# Other example
iris |> 
  group_by(Species) |> 
  summarize(across(ends_with("Length"), mean))
# A tibble: 3 × 3
  Species    Sepal.Length Petal.Length
  <fct>             <dbl>        <dbl>
1 setosa             5.01         1.46
2 versicolor         5.94         4.26
3 virginica          6.59         5.55

Lab Exercise

From diamonds dataframe:

  1. How many observations (rows) have carat value greater than 3.1?
  2. What is the average of price of which carat equals to 1.0 ?
    • Use filter() and summarize()
  3. What is the minimum, average, maximum of the price by cut?
    • Use min() and max()
  4. How many observations are found by each cut?
    • Use n()
  5. Arrange the dataframe by carat (descending) and then price (ascending).
  6. Mutate a new column, named xyz, which is mulplication of x and y and z. Store the dataframe as my_diamond.
  7. What is the maximum value of xyz (max(my_diamond$xyz))?
  8. What is the correlation between carat and price by each cut?
    • Use cor(x,y) for correlation between x and y

map function on dataframe

Advanced example

Running repeated regressions and get coefficients, by each group.

  • Useful for CAPM beta estimations
iris |>
  group_by(Species) |>
  nest() |>
  mutate(model = map(data, \(x) lm(Sepal.Length ~ Sepal.Width, x)),
         intercept = map_dbl(model, \(x) x$coefficients[[1]]),
         slope = map_dbl(model, \(x) x$coefficients[[2]])) |> 
  select(Species, intercept, slope)
# A tibble: 3 × 3
# Groups:   Species [3]
  Species    intercept slope
  <fct>          <dbl> <dbl>
1 setosa          2.64 0.690
2 versicolor      3.54 0.865
3 virginica       3.91 0.902

Lab Exercise

Summarize Financial data

  1. Install tidyquant package
install.packages('tidyquant')
  1. Download end-of-day stock price data using code below
library(tidyquant)
Warning: package 'zoo' was built under R version 4.4.3
tickers = c("MSFT","TSLA", "AAPL", "BAC")
stock_prices = tq_get(tickers, from = '2020-01-01', to = '2023-12-31')
  1. How is the average daily trading volume calculated for each company(symbol)? You should get result like below:
symbol mean(volume)
AAPL 98953731
BAC 52857474
MSFT 30654553
TSLA 133226380
  1. What is the average trading volume of each company during the peak COVID-19 Season (March 2020)? You should get:
  • Hint: filter() and use year() and month() to date variable
symbol mean(volume)
AAPL 285457836
BAC 128477227
MSFT 73304341
TSLA 287001136

Data Tidying

Tidy data

The real world data does not come clean, ready for your analysis.

You will learn a consistent data structure, i.e. Tidy data.

Data examples

Same data can be presented in various ways. The example data has 4 information:

  • country
  • year
  • population
  • Number of TB (tuberculosis) cases

Data (table1, table2, table3) is is available when you load tidyverse

Example 1

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Example 2

country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Example 3

country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

Tidy data

Which data structure is easier to work with, for general purpose?

Tidy data

A definition of tidy data:

  1. Each variable is a column;
  2. Each observation is a row;
  3. Each value is a cell; each cell is a single value.

Tidy data

When data is tidy, it is generally easier to work with the data.

  • However, sometimes you’ll need to pivot (reshape) the data for your analysis.

Example

If you want to generate a rate of TB per 10,000 population on table1:

table1 |> 
  mutate(rate = cases / population * 10000) |> 
  print()
# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67 

It is not simple to operate this with table2 setting.

table2 # where should rate variable be located??
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Quick exercise

Load package tidyverse.

table1 will be ready for access.

  1. What is the average population of each country? (Hint: group_by() and summarize())
  2. What the total sum of TB cases of all years by each country?
  3. What is the average percentage ratio of TB cases over population by each country?

Pivot data

For your analysis, you will need to pivot the data (aka reshape):

  • to longer form (less variables and more observations)

  • or to wide form (more columns and less rows)

Tip

Whenever you pivot the data, think about columns that are affected, and the names and values.

Billboard example

Let’s take a look at billboard dataset which is in wide form.

Pivot longer

Each observation is a song, and we have 76 columns that describe rank.

To tidy up, we want “rank” variable in column that stores the number.

How can we pivot the data so that we have rank in one column?

Simple illustration on how pivot_longer() works:

Pivot Billboard data

artist track date.entered week rank
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA

Exercise

  1. Load billboard data with
data(billboard)
print(billboard)
  1. Select artist, track the columns that are “wk1” to “wk12”
  • Browse ?num_range
  • Use num_range() in select()
  1. Pivot the data to longer form.

Pivot data: wider

pivot_wider() works in opposite way in that:

  • increases the number of columns (variables)

  • decreases the number of rows (observations)

Wider forms are common for machine learning.

  • One hot encoding / dummy variables

Example

Previous billboard_longer data:

head(billboard_longer, 5)
# A tibble: 5 × 5
  artist track                   date.entered week   rank
  <chr>  <chr>                   <date>       <chr> <dbl>
1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87

Pivot the data to wider form:

billboard_longer |> 
  pivot_wider(
    names_from = week, # must be provided
    values_from = rank # must be provided
  ) |> 
  head(5)
# A tibble: 5 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Note that we need at least two inputs for pivot_wider().

Financial Data Example

Stock price data commonly comes in the form below:

symbol date open high low close volume adjusted
TSLA 2023-01-03 118.47 118.80 104.64 108.10 231402800 108.10
TSLA 2023-01-04 109.11 114.59 107.52 113.64 180389000 113.64
TSLA 2023-01-05 110.51 111.75 107.16 110.34 157986300 110.34
TSLA 2023-01-06 103.00 114.39 101.81 113.06 220911100 113.06
TSLA 2023-01-09 118.96 123.52 117.11 119.77 190284000 119.77
TSLA 2023-01-10 121.07 122.76 114.92 118.85 167642500 118.85

Question: What if you wanted to have a column for each stock’s adjusted price, like below?

date TSLA BAC XOM
2023-01-03 108.10 30.97426 96.07031
2023-01-04 113.64 31.55660 96.34993
2023-01-05 110.34 31.49189 98.50565
2023-01-06 113.06 31.80616 99.69629
2023-01-09 119.77 31.32551 97.83820
2023-01-10 118.85 31.53811 99.29942
stock_prices |> 
  select(symbol, date, adjusted) |> 
  pivot_wider(
    names_from = symbol,
    values_from = adjusted)
# A tibble: 250 × 4
   date        TSLA   BAC   XOM
   <date>     <dbl> <dbl> <dbl>
 1 2023-01-03  108.  31.0  96.1
 2 2023-01-04  114.  31.6  96.3
 3 2023-01-05  110.  31.5  98.5
 4 2023-01-06  113.  31.8  99.7
 5 2023-01-09  120.  31.3  97.8
 6 2023-01-10  119.  31.5  99.3
 7 2023-01-11  123.  31.8 100. 
 8 2023-01-12  124.  31.9 102. 
 9 2023-01-13  122.  32.6 102. 
10 2023-01-17  131.  31.9 102. 
# ℹ 240 more rows

Class Exercise

  1. Examine data structure of table1 and table2 available on tidyverse.

How can you transform table1 to table2?

Answer

table1 |> 
  pivot_longer(
    cols = c(cases, population), # multiple selection: use c()
    names_to = 'type',
    values_to = 'count'
  )
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Lab problem

Examine data population available from tidyverse.

Pivot the data to generate output as below.

(Hint: Browse document file ?pivot_wider() and see names_prefix argument.)

country year_1995 year_1996 year_1997 year_1998 year_1999 year_2000 year_2001 year_2002 year_2003 year_2004 year_2005 year_2006 year_2007 year_2008 year_2009 year_2010 year_2011 year_2012 year_2013
Afghanistan 17586073 18415307 19021226 19496836 19987071 20595360 21347782 22202806 23116142 24018682 24860855 25631282 26349243 27032197 27708187 28397812 29105480 29824536 30551674
Albania 3357858 3341043 3331317 3325456 3317941 3304948 3286084 3263596 3239385 3216197 3196130 3179573 3166222 3156608 3151185 3150143 3153883 3162083 3173271
Algeria 29315463 29845208 30345466 30820435 31276295 31719449 32150198 32572977 33003442 33461345 33960903 34507214 35097043 35725377 36383302 37062820 37762962 38481705 39208194
American Samoa 52874 53926 54942 55899 56768 57522 58176 58729 59117 59262 59117 58652 57919 57053 56245 55636 55274 55128 55165
Andorra 63854 64274 64090 63799 64084 65399 68000 71639 75643 79060 81223 81877 81292 79969 78659 77907 77865 78360 79218
Angola 12104952 12451945 12791388 13137542 13510616 13924930 14385283 14886574 15421075 15976715 16544376 17122409 17712824 18314441 18926650 19549124 20180490 20820525 21471618
Anguilla 9807 10063 10305 10545 10797 11071 11371 11693 12023 12342 12637 12903 13145 13365 13571 13768 13956 14132 14300
Antigua and Barbuda 68349 70245 72232 74206 76041 77648 78972 80030 80904 81718 82565 83467 84397 85349 86300 87233 88152 89069 89985
Argentina 34833168 35264070 35690778 36109342 36514558 36903067 37273361 37627545 37970411 38308779 38647854 38988923 39331357 39676083 40023641 40374224 40728738 41086927 41446246
Armenia 3223173 3173425 3137652 3112958 3093820 3076098 3059960 3047002 3036032 3025652 3014917 3002911 2989882 2977488 2968154 2963496 2964120 2969081 2976566

Financial Data

Data in Finance

Financial data encompasses information related to financial markets, instruments, and economic indicators. It includes:

  • Stock prices
  • Bonds and interest rates
  • Commodities prices
  • Real estate prices
  • Exchange rates
  • Financial statements

Types of Financial Data

  1. Time Series Data: Sequential data points over time, e.g., daily stock prices.
  2. Cross-sectional Data: Data at a single point in time across many entities, e.g., balance sheets of various companies.
  3. Panel Data: Combines time series and cross-sectional data, e.g., yearly revenue of several companies over a decade.

Financial Data Sources 1

  • Exchanges: NYSE, NASDAQ
  • Government Publications: FRED, U.S. Treasury, Economic reports
  • Financial News Outlets: Bloomberg, Reuters
  • Data Providers: Yahoo Finance, Quandl (NASDAQ Data Link)

Financial Data Sources 2

  • SEC EDGAR: U.S. corporate filings, public companies’ financials
  • World Bank: Global economic indicators, international trade
  • U.S. Census Bureau: Provides demographic information, and social indicators
  • Macroeconomic Indicators:
    • Bureau of Economic Analysis (BEA): For U.S. economic accounts, including GDP and personal income.
    • Federal Reserve Economic Data (FRED): A comprehensive database of U.S. financial and economic data.
    • International Monetary Fund (IMF): For global financial stability reports, world economic outlooks, and international financial statistics.

Recap on Portfolio theory

Risk and Return

The risk-return tradeoff is a fundamental concept in finance

  • Higher Risk: Greater potential for return but higher chance of loss
  • Lower Risk: More predictable outcomes, but typically lower returns

Calculating Returns

Return Calculation for Individual Assets:

The holding period return of an asset (\(r\)) over a period is calculated using the formula:

\[ r = \frac{P_{end} - P_{begin} + D}{P_{begin}} \]

  • \(r\) is the rate of return for the holding period

  • \(P_{end}\) is the ending price of the asset

  • \(P_{begin}\) is the beginning price of the asset

  • \(D\) represents any dividends or income received during the period

Calculating Returns

Tip
  • We don’t need to take care of \(D\) when we calculate return based on adjusted close price
  • Typically holding period is set with the same frequency, such as minute, hourly, daily, weekly, or monthly.

Portfolio Return

The return on a portfolio is a weighted sum of the individual returns of the assets within the portfolio.

  • Portfolio Return Formula:

The return of a portfolio (\(r_p\)) is calculated as:

\[ r_p = \sum_{i=1}^{n} w_i r_i \]

  • \(r_p\) is the return of the portfolio over the period

  • \(w_i\) is the weight of asset \(i\) in the portfolio

  • \(r_i\) is the return of asset \(i\) over the period

  • \(n\) is the number of assets in the portfolio

Analytical Exercise

Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.

If 60% of your portfolio is invested in asset A and 40% in asset B,

Calculate the expected annual return of the portfolio.

Stand-Alone Risk

Stand-alone risk considers the risk of a single asset independently

  • Measured by Standard Deviation of historic period returns

The formula for risk (\(\sigma\)) is:

\[ \sigma = \sqrt{\frac{\sum (r_t - \bar{r})^2}{T - 1}} = sd(r) \]

Where:

  • \(r_t\) are the returns in period \(t\)
  • \(\bar{r}\) is the average return
  • \(T\) is the number of periods

Portfolio Risk

Portfolio risk involves the risk associated with holding a portfolio of assets.

  • Not simply the weighted average of the stand-alone risks of portfolio assets
  • Influenced by correlation between returns of assets

Analytical Exercise

Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.

The annual standard deviation of returns for asset A is 10% (0.1), and for asset B, it’s 15% (0.15). The correlation coefficient between the returns of assets A and B is 0.5.

If 60% of your portfolio is invested in asset A and 40% in asset B,

Calculate the expected annual risk of the portfolio.

\[ σ_p^2​=w_A^2​σ_A^2​+w_B^2​σ_B^2​+2w_A​w_B​σ_A​σ_B​ρXY​ \]

Portfolio Risk

Empirical approach

At each time \(t\), calculate the return of a portfolio (\(r_p\))

\[ r_{p,t} = \sum_{i=1}^{n} w_{i,t} r_{i,t} \]

Then calculate standard deviation of portfolio returns (\(\sigma_p\)) over \(t\)

\[ \sigma_p = sd(r_p) = \sqrt{\frac{\sum (r_{p,t} - \bar{r_p})^2}{T - 1}} \]

Walk-through Exercise

Step 1: Get stock prices

library(tidyverse)
library(tidyquant)
stock_prices <- tq_get(c('TSLA','BAC','XOM'), 
                      from = '2020-01-01', 
                      to = '2023-12-31')
stock_prices <- stock_prices |> 
    select(symbol, date, adjusted)
stock_prices |> head()
# A tibble: 6 × 3
  symbol date       adjusted
  <chr>  <date>        <dbl>
1 TSLA   2020-01-02     28.7
2 TSLA   2020-01-03     29.5
3 TSLA   2020-01-06     30.1
4 TSLA   2020-01-07     31.3
5 TSLA   2020-01-08     32.8
6 TSLA   2020-01-09     32.1

Step 2: Period returns

stock_returns <- stock_prices |> 
  arrange(symbol, date) |> 
  group_by(symbol) |> 
  mutate(
    daily_return = adjusted/lag(adjusted) - 1,
    ) |> 
  select(symbol, date, daily_return)

stock_returns <- stock_returns |> drop_na()

stock_returns |> head()
# A tibble: 6 × 3
# Groups:   symbol [1]
  symbol date       daily_return
  <chr>  <date>            <dbl>
1 BAC    2020-01-03     -0.0208 
2 BAC    2020-01-06     -0.00143
3 BAC    2020-01-07     -0.00660
4 BAC    2020-01-08      0.0101 
5 BAC    2020-01-09      0.00172
6 BAC    2020-01-10     -0.00828

Step 3: Pivot to Wide-form

We will use pivot technique, to deviate from tidy form.

returns_wide <- stock_returns  |> 
  pivot_wider(names_from = symbol, 
              values_from = daily_return)
returns_wide |> head()
# A tibble: 6 × 4
  date            BAC     TSLA      XOM
  <date>        <dbl>    <dbl>    <dbl>
1 2020-01-03 -0.0208   0.0296  -0.00804
2 2020-01-06 -0.00143  0.0193   0.00768
3 2020-01-07 -0.00660  0.0388  -0.00818
4 2020-01-08  0.0101   0.0492  -0.0151 
5 2020-01-09  0.00172 -0.0219   0.00766
6 2020-01-10 -0.00828 -0.00663 -0.00889

Drop row with missing values

returns_wide <- returns_wide |> 
  drop_na()

Step 4: Generate portfolio return

# Generating portfolio returns and
# calculating standard deviation of historic returns
weights <- c(0.2,0.3,0.5)
returns_wide <- returns_wide |>
    rowwise() |> 
    mutate(port_ret = sum(c_across(!date) * weights)) |> 
    ungroup()
returns_wide |> head()
# A tibble: 6 × 5
  date            BAC     TSLA      XOM  port_ret
  <date>        <dbl>    <dbl>    <dbl>     <dbl>
1 2020-01-03 -0.0208   0.0296  -0.00804  0.000718
2 2020-01-06 -0.00143  0.0193   0.00768  0.00933 
3 2020-01-07 -0.00660  0.0388  -0.00818  0.00623 
4 2020-01-08  0.0101   0.0492  -0.0151   0.00924 
5 2020-01-09  0.00172 -0.0219   0.00766 -0.00241 
6 2020-01-10 -0.00828 -0.00663 -0.00889 -0.00809 

Step 5: Calculate Portfolio risk (sd)

portfolio_risk_m2 <- returns_wide |> 
    summarize(across(!date, sd))
portfolio_risk_m2
# A tibble: 1 × 4
     BAC   TSLA    XOM port_ret
   <dbl>  <dbl>  <dbl>    <dbl>
1 0.0241 0.0429 0.0235   0.0220

Recap on Capital Asset Pricing Model

CAPM

Derived from portfolio theory and optimization.

\[ E(r_i​)=r_f​+β_i​(E(r_m​)−r_f​)\]

Where:

  • \(E(r_i)\) is the expected return of the investment

  • \(r_f\) is the risk-free rate

  • \(\beta_i\) is the beta of the investment

  • \(E(r_m)\) is the expected return of the market

  • \((E(r_m) - r_f)\) is known as the market risk premium

Beta

Beta is a measure of the sensitivity of an individual investment’s returns to the market.

  • An indication of an asset’s risk relative to the systematic risk

\[ \beta_i = \frac{Cov(r_i - r_f, r_m-r_f)}{Var(r_m-r_f)} \]

  • Same as coefficient from below simple linear regression \(\beta_1\)

\[ r_i - r_f = \beta_0 + \beta_1(r_m - r_f)+ e_i \]

Tip

In some specific cases (e.g., intraday frequency) beta is estimated with raw return (\(r_i\) and \(r_m\)), not excess returns (\(r_i - r_f\))

Alpha

  • Alpha (\(\alpha\)) is a financial metric indicating the extra return compared to the return of a benchmark index
  • In CAPM, \(\alpha\) is the intercept of the regression
  • A positive \(\alpha\) indicates outperformance, while a negative indicates underperformance

Walk-through Exercise

Prepare a stock return, market return (index).

Then combine two dataframe (bind them in row-wise).

tickers <- c("MSFT")
stock_prices <- tq_get(tickers, from = '2020-01-01', to = '2023-12-31')
snp500 <- tq_get("SP500", get = "economic.data", from = "2020-01-01", to = "2023-12-31")
snp500 <- snp500 |> rename(adjusted = price) # rename column for binding
prices <- bind_rows(stock_prices, snp500)
prices |> glimpse()
Rows: 2,049
Columns: 8
$ symbol   <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
$ date     <date> 2020-01-02, 2020-01-03, 2020-01-06, 2020-01-07, 2020-01-08, …
$ open     <dbl> 158.78, 158.32, 157.08, 159.32, 158.93, 161.84, 162.82, 161.7…
$ high     <dbl> 160.73, 159.95, 159.10, 159.67, 160.80, 162.22, 163.22, 163.3…
$ low      <dbl> 158.33, 158.06, 156.51, 157.32, 157.95, 161.03, 161.18, 161.2…
$ close    <dbl> 160.62, 158.62, 159.03, 157.58, 160.09, 162.09, 161.34, 163.2…
$ volume   <dbl> 22622100, 21116200, 20813700, 21634100, 27746500, 21385000, 2…
$ adjusted <dbl> 152.5057, 150.6067, 150.9960, 149.6192, 152.0025, 153.9014, 1…

Pivot price data into wide form:

price_wide = prices |> 
  pivot_wider(names_from = symbol, 
              values_from = adjusted,
              id_cols = date)
price_wide |> head()
# A tibble: 6 × 3
  date        MSFT SP500
  <date>     <dbl> <dbl>
1 2020-01-02  153. 3258.
2 2020-01-03  151. 3235.
3 2020-01-06  151. 3246.
4 2020-01-07  150. 3237.
5 2020-01-08  152. 3253.
6 2020-01-09  154. 3275.

Generate stock and index returns

# Use of anonymous function to calculate return
returns_wide <- price_wide |> 
  mutate(across(-date, \(x) { (x / lag(x) ) - 1})) 

Prepare risk-free rate (use 3 month treasury bill rate)

risk_free_data = tq_get("DGS3MO", 
                        get = "economic.data", 
                        from = "2020-01-01", 
                        to = "2023-12-31")
risk_free_data |> head(3)
# A tibble: 3 × 3
  symbol date       price
  <chr>  <date>     <dbl>
1 DGS3MO 2020-01-01 NA   
2 DGS3MO 2020-01-02  1.54
3 DGS3MO 2020-01-03  1.52

The risk-free rate is in percentage term (%) and annualized.

Convert the rate into daily level

  • Use 252 business days assumption
  • Use simple division for daily conversion \(r_d = r_y/252\)
risk_free_converted <- risk_free_data |> 
  mutate(rf = (price / 252) / 100)
risk_free_converted |> head()
# A tibble: 6 × 4
  symbol date       price         rf
  <chr>  <date>     <dbl>      <dbl>
1 DGS3MO 2020-01-01 NA    NA        
2 DGS3MO 2020-01-02  1.54  0.0000611
3 DGS3MO 2020-01-03  1.52  0.0000603
4 DGS3MO 2020-01-06  1.56  0.0000619
5 DGS3MO 2020-01-07  1.54  0.0000611
6 DGS3MO 2020-01-08  1.54  0.0000611

Convert to wide form:

risk_free_converted <- risk_free_converted |> 
  pivot_wider(id_cols= date,
              names_from = symbol,
              values_from = rf
              )
risk_free_converted |> head()
# A tibble: 6 × 2
  date           DGS3MO
  <date>          <dbl>
1 2020-01-01 NA        
2 2020-01-02  0.0000611
3 2020-01-03  0.0000603
4 2020-01-06  0.0000619
5 2020-01-07  0.0000611
6 2020-01-08  0.0000611

Join two dataframe column-wise

capm_data <- returns_wide |> 
  left_join(risk_free_converted, by = "date")
capm_data |> head(3)
# A tibble: 3 × 4
  date           MSFT    SP500    DGS3MO
  <date>        <dbl>    <dbl>     <dbl>
1 2020-01-02 NA       NA       0.0000611
2 2020-01-03 -0.0125  -0.00706 0.0000603
3 2020-01-06  0.00258  0.00353 0.0000619

Then generate excess returns:

capm_data <- capm_data |> 
  mutate(
    date,
    MSFT_exret = MSFT - DGS3MO,
    Mkt_exret = SP500 - DGS3MO)

CAPM Estimation

Use lm() for linear regression fit.

capm_fit <- lm(MSFT_exret ~ Mkt_exret, capm_data)
print(capm_fit)

Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)

Coefficients:
(Intercept)    Mkt_exret  
  0.0005149    1.1718518  

To browse summary of regression:

summary(capm_fit)

Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.068986 -0.006335 -0.000456  0.006333  0.076456 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0005149  0.0003655   1.409    0.159    
Mkt_exret   1.1718518  0.0251585  46.579   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01154 on 996 degrees of freedom
  (45 observations deleted due to missingness)
Multiple R-squared:  0.6854,    Adjusted R-squared:  0.6851 
F-statistic:  2170 on 1 and 996 DF,  p-value: < 2.2e-16

How to access and extract coefficient estimates?

# Remember that the model object is "list"
capm_fit$coefficients
 (Intercept)    Mkt_exret 
0.0005148837 1.1718518320 
capm_fit$coefficients |> typeof()
[1] "double"

To extract beta estimate:

capm_fit$coefficients[[2]]
[1] 1.171852

Discussions

Timeframe

CAPM estimate (or any other) is largely dependent on the timeframe of your choice.

Consider following:

  • Estimate based on recent 1 month
  • Estimate based on recent 3 month
  • Estimate based on recent 12 month

Should they be similar? Usually not.

Exercise

Using our previous example, estimate the CAPM beta with the following:

  • Assume today is 2023-12-31.
  • Estimate based on recent 1 month data
  • Estimate based on recent 3 month data
  • Estimate based on recent 12 month data

Rolling Windows

Given a set “lookback” period, estimate should only use previous information.

To capture timely information at each period, rolling regressions are often performed.

Lab Problems

Problem 1

Generate stock prices with below.

stock_prices = tq_get(c('AAPL','MSFT'), 
                      from = '2020-01-01', 
                      to = '2023-12-31')

Using adjusted daily closing prices, calculate the expected (average) return and risk of

  • a equal-weighted portfolio

  • 30% on Apple and 70% on Microsoft

  • 70% on Apple and 30% on Microsoft

Problem 2

Based on the portfolio return (70% on Apple and 30% on Microsoft), calculate the portfolio beta.

  • Use daily periodicity.
  • Use 3 month treasury rate for risk-free rate.

Data I/O

Read and Write

  • Importing data is crucial initial process for any data science project.

  • We will learn how to read external data to R, in data.frame object.

  • Also how to write data.frame in R to a local file.

CSV files

  • CSV: Comma-Seperated Values

  • A plain, human-readable text data file

  • Minimalistic, widely used

  • Typically opened with Excel, but it is not an excel file!

  • Since it is text, R tries to “guess” the type of each column when importing

CSV Example

A csv representation of iris dataframe:

iris |> head(5)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

When saved to csv file:

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa

CSV files

Write and read csv

Many packages support writing/reading csv files;

  • base R (utils package): basic but slow
  • readr from tidyverse: fast, functional
  • vroom, data.table: extremely fast and functional

We use readr package and will discuss I/O speed later.

CSV files

To write a data.frame to a csv file: write_csv()

# saves csv file on current working directory
iris |> write_csv('iris_file.csv') 
write_csv(iris, 'iris_file.csv') # same

To read a .csv file to a data.frame: read_csv()

my_iris <- read_csv('iris_file.csv')
Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(my_iris)
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 

You can specify a downloadable url instad:

url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
test_df <- read_csv(url)
Rows: 398 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): origin, name
dbl (7): mpg, cylinders, displacement, horsepower, weight, acceleration, mod...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_df |> head()
# A tibble: 6 × 9
    mpg cylinders displacement horsepower weight acceleration model_year origin
  <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl>      <dbl> <chr> 
1    18         8          307        130   3504         12           70 usa   
2    15         8          350        165   3693         11.5         70 usa   
3    18         8          318        150   3436         11           70 usa   
4    16         8          304        150   3433         12           70 usa   
5    17         8          302        140   3449         10.5         70 usa   
6    15         8          429        198   4341         10           70 usa   
# ℹ 1 more variable: name <chr>

xlsx files

To read a .xlsx file to a data.frame: read_excel() from readxl package

exxon_statement <- readxl::read_excel(
  'data/financial-statements-exxon-mobil.xlsx',
  sheet = 1, # first sheet
  skip  = 1
)
head(exxon_statement)
# A tibble: 6 × 11
  `in million USD`         `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
  <chr>                       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Net Income/Starting Line    19658    31398    42206    47681    33448    33615
2 Depreciation & Amortiza…    11917    14760    15583    15888    17182    17297
3 Non-Cash Items              -1474      129      262     3313      482     3110
4 Change in Working Capit…    -1663     2126    -2706   -10712    -6198    -8906
5 Cash from Operating Act…    28438    48413    55345    56170    44914    45116
6 Change in Fixed Assets …   -22491   -26871   -30975   -34271   -33669   -32952
# ℹ 4 more variables: `FY '15` <dbl>, `FY '16` <dbl>, `FY '17` <dbl>,
#   `FY '18` <dbl>

Other data formats

There are other common data formats:

  • ‘.dat’, ‘.sas7bdat’, ‘.dta’
  • ‘.rds’: R native
  • ‘.json’: flexible data structure (NoSQL)
  • ‘.parquet’,‘.feather’

Data cleaning

janitor package offers simple variable name cleaner: clean_names().

exxon_statement <- exxon_statement |> 
  janitor::clean_names()
exxon_statement |> head()
# A tibble: 6 × 11
  in_million_usd   fy_09  fy_10  fy_11  fy_12  fy_13  fy_14  fy_15  fy_16  fy_17
  <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Net Income/Sta…  19658  31398  42206  47681  33448  33615  16551   8375  19848
2 Depreciation &…  11917  14760  15583  15888  17182  17297  18048  22308  19893
3 Non-Cash Items   -1474    129    262   3313    482   3110   -750  -5313  -7921
4 Change in Work…  -1663   2126  -2706 -10712  -6198  -8906  -3505  -3288  -1754
5 Cash from Oper…  28438  48413  55345  56170  44914  45116  30344  22082  30066
6 Change in Fixe… -22491 -26871 -30975 -34271 -33669 -32952 -26490 -16163 -15402
# ℹ 1 more variable: fy_18 <dbl>

Exercise 1 : Read

  1. Install and load new packages: janitor and readxl
  2. Download the exxon mobile data to your directory.
  3. Read the statement file with read_excel() and store as exxon_statement
  4. Clean the variable names with clean_names().

Exercise 2: Join

  1. Read exxon_categories.csv file with read_csv() as exxon_categories.
  2. Left join exxon_categories to exxon_statement as:
exxon_statement <- exxon_statement |> 
  left_join(
    exxon_categories, 
    # Your work here
  )

Exercise 3 : Cleaning and Tidying

  1. Rename variable in_million_usd to Account.
  2. Pivot to a long form and store as exxon_long with below arguments:
  • use starts_with()
  • names_to = “Year”
  • names_prefix = “fy_”
  • values_to = “M_USD”

Grammar of Graphics (ggplot)

Case 1: Single Categorical variable

Categorical variables (factors) take a predefined set of values.

  • Ordered: size (Large, medium,…) grades (A, A-, …)
  • Unordered: country, ethnicity

To visualize distribution of categorical variable, bar plots are often used.

geom_bar() and geom_col()

geom_bar() : When you need count of single category

geom_col() : When you need different Y

modified_exxon <- read_csv("https://raw.githubusercontent.com/matthewgson/public_data/refs/heads/main/data/modified_exxon.csv")
Rows: 170 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Account, Category, Year
dbl (1): M_USD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Example 1: geom_bar()
modified_exxon |> 
  ggplot(aes(x = Category)) +
  geom_bar()

# Example 2: geom_col()
total_amt_bycat <- modified_exxon |> 
  summarize(Total_USD = sum(M_USD, na.rm = TRUE), .by = c(Category))


total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col()

Effective visualization

The end goal of the bar plot:

Adding color to Bar plots

To fill the color, use “fill”. For border color, use “color”.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue")

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black")

Scaling Y axis

To change scientific notation to currency format: use scales package.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    labels = scales::label_currency(prefix = "$", suffix = "T", scale = 0.001)
  )

More ticks

To add more breaks (ticks) on Y axis, control n.breaks in continuous scale.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  )

Axis Labels

To add axis labels and plot title, use labs() function.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  ) +
  labs(
    title = "Exxon Mobile Total Revenue / Expenditure",
    subtitle = "From Years 2009 - 2018",
    x = "",
    y = "Total Revenue / Expenditure ($)",
    caption = "Source: Exxon Financial Statements"
  )

Adding Themes

By default, ggplot uses gray background. To change, you can use other built-in themes.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  ) +
  labs(
    title = "Exxon Mobile Total Revenue / Expenditure",
    subtitle = "From Years 2009 - 2018",
    x = "",
    y = "Total Revenue / Expenditure ($)",
    caption = "Source: Exxon Financial Statements"
  ) +
  theme_bw()

Reordering by value

To reorder factor variable according to its value: fct_reorder() on the variable.

total_amt_bycat |> 
  ggplot(aes(x = fct_reorder(Category, -Total_USD) , y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  ) +
  labs(
    title = "Exxon Mobile Total Expenditure",
    subtitle = "From Years 2009 - 2018",
    x = "",
    y = "Total Expenditure ($)",
    caption = "Source: Exxon Financial Statements"
  ) +
  theme_bw()

Case 2: Single Numeric variable

Numeric variable has infinite range of values.

  • In finance, risk and returns, ratios (ROA)

Commonly used visualization is histogram

  • Proper bin size is important for effective visualization.
stock_returns <- tq_get(c("AAPL", "KO"), from = "2020-01-01", to = "2023-12-31") |> 
  group_by(symbol) |> 
  arrange(symbol, date) |> 
  mutate(pct_ret = (adjusted / lag(adjusted) - 1) * 100) |> 
  ungroup() |> 
  select(symbol, date, pct_ret)

stock_returns |> 
  filter(symbol == "AAPL") |> 
  ggplot(aes(x = pct_ret)) +
  geom_histogram() # default bins: 30
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

stock_returns |> 
  filter(symbol == "AAPL") |> 
  ggplot(aes(x = pct_ret))+
  geom_histogram(bins = 10)
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

stock_returns |> 
  filter(symbol == "AAPL") |> 
  ggplot(aes(x = pct_ret))+
  geom_histogram(bins = 50)
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).


An alternative for numeric distribution is density plot.

  • Uses kernel smoothing
  • Bandwith (bw) choice is important
stock_returns |> 
  ggplot(aes(x = pct_ret))+
  geom_density() # auto
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

stock_returns |> 
  ggplot(aes(x = pct_ret))+
  geom_density(bw = 0.01)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

stock_returns |> 
  ggplot(aes(x = pct_ret))+
  geom_density(bw = 0.1)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

Case 3: Numeric + Categorical

Visualizing two (or more) variables can be useful to show variable relationships.

  • Boxplot, Violin, Density plots
stock_returns |> 
  ggplot(aes(x = symbol, y = pct_ret)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

stock_returns |> 
  ggplot(aes(x = symbol, y = pct_ret)) +
  geom_violin()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).

stock_returns |> 
  ggplot(aes(x = pct_ret, fill = symbol, color = symbol)) +
  geom_density(alpha = 0.5)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

Case 4: Two categorical

Barplots can be used to visualize two categorical variables.

  • Use “fill” aesthetic to add one more dimension
modified_exxon |> 
  group_by(Year, Category) |> 
  summarize(Total_M_USD = sum(M_USD)) |> 
  ggplot(aes(x=Year, y = Total_M_USD, fill = Category)) +
  geom_col() 
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).

modified_exxon |> 
  group_by(Year, Category) |> 
  summarize(Total_M_USD = sum(M_USD)) |> 
  ggplot(aes(x=Year, y = Total_M_USD, fill = Category)) +
  geom_col() +
  scale_fill_brewer(palette ='Set1')
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).

modified_exxon |> 
  group_by(Year, Category) |> 
  summarize(Total_M_USD = sum(M_USD)) |> 
  ggplot(aes(x=Year, y = Total_M_USD, fill = Category)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette ='Set1')
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).

Case 5: Two numerical

Scatterplots and regressions can effectively visualize relationships.

stock_returns |> 
  pivot_wider(names_from = symbol, values_from = pct_ret) |> 
  ggplot(
    aes(x = AAPL, y = KO)
  ) +
  geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

stock_returns |> 
  pivot_wider(names_from = symbol, values_from = pct_ret) |> 
  ggplot(
    aes(x = AAPL, y = KO)
  ) +
  geom_point() +
  geom_smooth(method = 'lm')
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Case 6: Time series

Time series plots have date variable on X axis.

Line plots are often used.

stock_returns |> 
  drop_na() |> 
  group_by(symbol) |> 
  arrange(symbol, date) |> 
  mutate(cumret = cumsum(pct_ret)) |> 
  filter(
    symbol == "AAPL", 
    year(date) == 2020
  ) |> 
  ggplot(aes(x = date, y = cumret)) +
  geom_line() + 
  theme_minimal()

stock_returns |> 
  drop_na() |> 
  group_by(symbol) |> 
  arrange(symbol, date) |> 
  mutate(cumret = cumsum(pct_ret)) |> 
  filter(
    year(date) == 2020
  ) |> 
  ggplot(aes(x = date, y = cumret, color = symbol)) +
  geom_line() +
  theme_minimal()

Time series: Candlestick

Candlestick price charts are often used as well:

# Plot candlestick chart
stock_price <- tq_get("AAPL", from = "2023-12-01", to = "2023-12-31")
stock_price |> 
  ggplot(aes(x = date, open = open, high = high, low = low, close = close)) +
  geom_candlestick() +
  labs(
    title = "AAPL Candlestick Chart, December 2023", 
    x = "Date", 
    y = "Price (USD)") +
  theme_bw() 

stock_price |>
  ggplot(aes(x = as.factor(date), open = open, high = high, low = low, close = close)) +
  geom_candlestick() +
  labs(
    title = "AAPL Candlestick Chart, December 2023", 
    x = "Date", 
    y = "Price (USD)") +
  theme_bw() +
  scale_x_discrete(
    breaks = as.factor(stock_price$date[seq(1, length(stock_price$date), 3)])
  )